34 research outputs found
Cppless: Productive and Performant Serverless Programming in C++
The rise of serverless introduced a new class of scalable, elastic and highly
available parallel workers in the cloud. Many systems and applications benefit
from offloading computations and parallel tasks to dynamically allocated
resources. However, the developers of C++ applications found it difficult to
integrate functions due to complex deployment, lack of compatibility between
client and cloud environments, and loosely typed input and output data. To
enable single-source and efficient serverless acceleration in C++, we introduce
Cppless, an end-to-end framework for implementing serverless functions which
handles the creation, deployment, and invocation of functions. Cppless is built
on top of LLVM and requires only two compiler extensions to automatically
extract C++ function objects and deploy them to the cloud. We demonstrate that
offloading parallel computations from a C++ application to serverless workers
can provide up to 30x speedup, requiring only minor code modifications and
costing less than one cent per computation
RFaaS: RDMA-Enabled FaaS Platform for Serverless High-Performance Computing
The rigid MPI programming model and batch scheduling dominate
high-performance computing. While clouds brought new levels of elasticity into
the world of computing, supercomputers still suffer from low resource
utilization rates. To enhance supercomputing clusters with the benefits of
serverless computing, a modern cloud programming paradigm for pay-as-you-go
execution of stateless functions, we present rFaaS, the first RDMA-aware
Function-as-a-Service (FaaS) platform. With hot invocations and decentralized
function placement, we overcome the major performance limitations of FaaS
systems and provide low-latency remote invocations in multi-tenant
environments. We evaluate the new serverless system through a series of
microbenchmarks and show that remote functions execute with negligible
performance overheads. We demonstrate how serverless computing can bring
elastic resource management into MPI-based high-performance applications.
Overall, our results show that MPI applications can benefit from modern cloud
programming paradigms to guarantee high performance at lower resource costs
Bridging Control-Centric and Data-Centric Optimization
With the rise of specialized hardware and new programming languages, code
optimization has shifted its focus towards promoting data locality. Most
production-grade compilers adopt a control-centric mindset - instruction-driven
optimization augmented with scalar-based dataflow - whereas other approaches
provide domain-specific and general purpose data movement minimization, which
can miss important control-flow optimizations. As the two representations are
not commutable, users must choose one over the other. In this paper, we explore
how both control- and data-centric approaches can work in tandem via the
Multi-Level Intermediate Representation (MLIR) framework. Through a combination
of an MLIR dialect and specialized passes, we recover parametric, symbolic
dataflow that can be optimized within the DaCe framework. We combine the two
views into a single pipeline, called DCIR, showing that it is strictly more
powerful than either view. On several benchmarks and a real-world application
in C, we show that our proposed pipeline consistently outperforms MLIR and
automatically uncovers new optimization opportunities with no additional
effort.Comment: CGO'2
User-guided Page Merging for Memory Deduplication in Serverless Systems
Serverless computing is an emerging cloud paradigm that offers an elastic and
scalable allocation of computing resources with pay-as-you-go billing. In the
Function-as-a-Service (FaaS) programming model, applications comprise
short-lived and stateless serverless functions executed in isolated containers
or microVMs, which can quickly scale to thousands of instances and process
terabytes of data. This flexibility comes at the cost of duplicated runtimes,
libraries, and user data spread across many function instances, and cloud
providers do not utilize this redundancy. The memory footprint of serverless
forces removing idle containers to make space for new ones, which decreases
performance through more cold starts and fewer data caching opportunities. We
address this issue by proposing deduplicating memory pages of serverless
workers with identical content, based on the content-based page-sharing concept
of Linux Kernel Same-page Merging (KSM). We replace the background memory
scanning process of KSM, as it is too slow to locate sharing candidates in
short-lived functions. Instead, we design User-Guided Page Merging (UPM), a
built-in Linux kernel module that leverages the madvise system call: we enable
users to advise the kernel of memory areas that can be shared with others. We
show that UPM reduces memory consumption by up to 55% on 16 concurrent
containers executing a typical image recognition function, more than doubling
the density for containers of the same function that can run on a system.Comment: Accepted at IEEE BigData 202
Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models
The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements
Automatic Empirical Performance Modeling of Parallel Programs
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes or solving larger problems. Often, such performance bugs manifest themselves only when the code is put into production, a point where remediation can be difficult. Manually creating analytical performance models provides insights into optimization opportunities but is extremely costly if done for applications of realistic size. The effort limits application developers to only attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. Furthermore, tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible.
If we have to consider multiple performance-relevant parameters and their possible interactions at the same time, a common requirement in many situations, this task becomes even more complex.
The initial contribution of this thesis is a method to substantially improve both coverage and speed of performance modeling and analysis. Generating an empirical performance model automatically for each part of a parallel program with respect to the variation of a relevant parameter such as process count or problem size, it becomes possible to easily identify those parts that will reduce performance at larger core counts or when solving a bigger problem.
In the next step, we extended the approach with a method capable of modeling any combination of multiple execution parameters simultaneously, provided sufficient performance measurements are available. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. Specialized heuristics developed as part of this work traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning.
Finally we present a method that employs automated performance modeling to quickly predict application requirements for varying scales and problem sizes. Following this approach, it is possible to determine future requirements of major scientific applications, derive an optimization strategy, and illustrate system design tradeoffs in the light of their requirements. This supports the co-design process by informing hardware acquisition decisions with the actual needs of the software.
The methods described in this work are implemented in the performance analysis tool Extra-P. Extra-P has been released as open source and has been successfully used to gain insight into the performance of numerous scientific applications from a large range of fields.
Since its release, Extra-P has an impact on the HPC community. Developers at both universities and research centers have used Extra-P to better understand the performance of their research codes.
Tutorials on the use of Extra-P have been offered at international conferences such as EuroMPI and Supercomputing further demonstrating the effectiveness of this approach in making performance modeling available to developers without requiring expert knowledge of the topic.
This work simplifies and streamlines the performance modeling process, offering insights into application behavior quickly and automatically and allowing the developer to focus on transforming these insights into tangible performance improvements